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Abstract 

In a wide range of statistical learning problems such as ranking, clustering or metric learning 
among others, the risk is accurately estimated by U-statistics of degree d > 1, i.e. function¬ 
als of the training data with low variance that take the form of averages over k-tuples. From 
a computational perspective, the calculation of such statistics is highly expensive even for a 
moderate sample size n, as it requires averaging 0 {ti‘^) terms. This makes learning procedures 
relying on the optimization of such data functionals hardly feasible in practice. It is the major 
goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically 
computationally simpler Monte-Carlo estimates based on 0(n) terms only, usually referred to as 
incomplete U-statistics, without damaging the Op(l /\/n) learning rate of Empirical Risk Mini¬ 
mization (ERM) procedures. For this purpose, we establish uniform deviation results describing 
the error made when approximating a U-process by its incomplete version under appropriate 
complexity assumptions. Extensions to model selection, fast rate situations and various sam¬ 
pling techniques are also considered, as well as an application to stochastic gradient descent for 
ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence 
that the approach we promote largely surpasses more naive subsampling techniques. 


1 Introduction 


In classification/regression, empirical risk estimates are sample mean statistics and the theory of 


Empirical Risk Minimization (ERM) has been originally developed in this context, see Devroye 


et al. (1996). The ERM theory essentially relies on the study of maximal deviations between these 


empirical averages and their expectations, under adequate complexity assumptions on the set of 
prediction rule candidates. The relevant tools are mainly concentration inequalities for empirical 
processes, see 


Ledoux and Talagrand (1991) for instance. 


In a wide variety of problems that received a good deal of attention in the machine learning 
literature and ranging from clustering to image recognition through ranking or learning on graphs, 
natural estimates of the risk are not basic sample means but take the form of averages of d-tuples. 


usually referred to as U-statistics in Probability and Statistics, see Lee (1990). In Clemenqon 


et al. (2005) for instance, ranking is viewed as pairwise classification and the empirical ranking 


error of any given prediction rule is a U-statistic of order 2, just like the within cluster point 
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scatter in cluster analysis (see Clemengon, 2014) or empirical performance measures in metric 
learning, refer to Cao et al. (2012) for instance. Because empirical functionals are computed 


by averaging over tuples of sampling observations, they exhibit a complex dependence structure, 
which appears as the price to be paid for low variance estimates. Linearization techniques (see 


Hoeffding, 1948) are the main ingredient in studying the behavior of empirical risk minimizers in 


this setting, allowing to establish probabilistic upper bounds for the maximal deviation of collection 
of centered U-statistics under appropriate conditions by reducing the analysis to that of standard 
empirical processes. However, while the ERM theory based on minimization of U-statistics is 


now consolidated (see Clemengon et al., 2008), putting this approach in practice generally leads 


to significant computational difficulties that are not sufficiently well documented in the machine 
learning literature. In many concrete cases, the mere computation of the risk involves a summation 
over an extremely high number of tuples and runs out of time or memory on most machines. 

Whereas the availability of massive information in the Big Data era, which machine learning 
procedures could theoretically now rely on, has motivated the recent development of parallelized / 


distributed approaches in order to scale-up certain statistical learning algorithms, see Bekkerman 


et al. (2011) or Bianchi et al. (2013) and the references therein, the present paper proposes to use 


sampling techniques as a remedy to the apparent intractability of learning from data sets of explosive 
size, in order to break the current computational barriers. More precisely, it is the major goal of this 
article to study how a simplistic sampling technique {i.e. drawing with replacement) applied to risk 
estimation, as originally proposed by Blom ( ]1976 ) in the context of asymptotic pointwise estimation, 
may efficiently remedy this issue without damaging too much the “reduced variance” property of 
the estimates, while preserving the learning rates (including certain ”fast-rate” situations). For 
this purpose, we investigate to which extent a U-process, that is a collection of U-statistics, can 
be accurately approximated by a Monte-Carlo version (which shall be referred to as an incomplete 
U-process throughout the paper) involving much less terms, provided it is indexed by a class of 
kernels of controlled complexity (in a sense that will be explained later). A maximal deviation 
inequality connecting the accuracy of the approximation to the number of terms involved in the 
approximant is thus established. This result is the key to the analysis of the statistical performance 
of minimizers of risk estimates when they are in the form of an incomplete U-statistic. In particular, 
this allows us to show the advantage of using this specific sampling technique, compared to more 
naive approaches with exactly the same computational cost, consisting for instance in first drawing 
a subsample and then computing a risk estimate of the form of a (complete) U-statistic based on it. 
We also show how to incorporate this sampling strategy into iterative statistical learning techniques 
based on stochastic gradient descent (SGD), see Bottou (1998). The variant of the SGD method we 
propose involves the computation of an incomplete U-statistic to estimate the gradient at each step. 
For the estimator thus produced, rate bounds describing its statistical performance are established 
under mild assumptions. Beyond theoretical results, we present illustrative numerical experiments 
on metric learning and clustering with synthetic and real-world data that support the relevance of 
our approach. 

The rest of the article is organized as follows. In Section we recall basic definitions and con¬ 
cepts pertaining to the theory of U-statistics/processes and present important examples in machine 
learning where natural estimates of the performance/risk measure are U-statistics. We then review 
the existing results for the empirical minimization of complete U-statistics. In Section we recall 
the notion of incomplete U-statistic and we derive maximal deviation inequalities describing the 
error made when approximating a U-statistic by its incomplete counterpart uniformly over a class 
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of kernels that fulfills appropriate complexity assumptions. This result is next applied to derive 
(possibly fast) learning rates for minimizers of the incomplete version of the empirical risk and to 
model selection. Extensions to incomplete U-statistics built by means of other sampling schemes 
than sampling with replacement are also investigated. In Section]^ estimation by means of incom¬ 
plete U-statistics is applied to stochastic gradient descent for iterative ERM. Section presents 
some numerical experiments. Finally, Section [^collects some concluding remarks. Technical details 
are deferred to the Appendix. 


2 Background and Preliminaries 


As a first go, we briefly recall some key notions of the theory of U-statistics (Section 2.1) and provide 
several examples of statistical learning problems for which natural estimates of the performance/risk 
measure are in the form of U-statistics (Section |2.2[ ). Finally, we review and extend the existing rate 
bound analysis for the empirical minimization of (complete) generalized U-statistics (Section |2.3[ ). 
Here and throughout, N* denotes the set of all strictly positive integers, M+ the set of nonnegative 
real numbers. 


2.1 U-Statistics/Processes: Definitions and Properties 


For clarity, we recall the definition of generalized U-statistics. An excellent account of properties 
and asymptotic theory of U-statistics can be found in Lee (1990). 


Definition 1. (Generalized U-statistic) Let K > 1 and (di, ..., dK) = 

..., 1 < k < K, 6e K independent samples of sizes > dk and composed of i.i.d. 

random variables taking their values in some measurable space Ak wUh distribution Fk(dx) respec¬ 
tively. Let H : x • • • x X^'^ —> M 6e a measurable function, square integrable with respect to the 

probability distribution p. = (8) • • • <8) F®‘^'^. Assume in addition (without loss of generality) that 

H ..., is symmetric within each block of arguments x^'^^ (valued in X^'^), 1 < k < K. 

The generalized (or K-sample) M-statistic of degrees (di, ..., dk) with kernel H, is then defined as 


Un(H) 



x<f. 

b Ik 



( 1 ) 


where the symbol refers to summation over all (((J)) subsets = (x|^\ ..., x|^^ ) related to 
a set Ik of dk indexes 1 < ii < ... < idk < TLk and n = (ni, ..., rk) . 

The above definition generalizes standard sample mean statistics, which correspond to the case 
K = 1 = di. More generally when K = 1, Un(H) is an average over all di-tuples of observations, 
while K > 2 corresponds to the multi-sample situation with a dk-tuple for each sample k G (1,..., K}. 
A U-process is defined as a collection of U-statistics indexed by a set LL of kernels. This concept 
generalizes the notion of empirical process. 

Many statistics used for pointwise estimation or hypothesis testing are actually generalized 
U-statistics (e.g. the sample variance, the Gini mean difference, the Wilcoxon Mann-Whitney 
statistic, Kendall tau). Their popularity mainly arises from their “reduced variance” property: the 
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statistic Un(H) has minimum variance among all unbiased estimators of the parameter 


p(H) = E 


H(xj^ 


X 


( 1 ) 

di > 


X 


(K) 


X 


dK J 


( 2 ) 






H(x 


( 1 ) 


X' 


(K)-jdp®di 


X' 


.dF®dK 


X 


(K)1 _ 


= E[U„(H)]. 


Classically, the limit properties of these statistics (law of large numbers, central limit theorem, 
etc.) are investigated in an asymptotic framework stipulating that, as the size of the full pooled 
sample 

def , . 

n = rii + ... + riK (3) 

tends to infinity, we have: 

Tiic/n —> Ak > 0 for k = 1, ..., K. (4) 

Asymptotic results and deviation/moment inequalities for K-sample U-statistics can be classically 


established by means of specific representations of this class of functionals, see (15) and (27) in¬ 


troduced in later sections. Significant progress in the analysis of U-statistics and U-processes has 


then recently been achieved by means of decoupling theory, see de la Pena and Cine (1999). For 


completeness, we point out that the asymptotic behavior of (multisample) U-statistics has been 
investigated under weaker integrability assumptions than that stipulated in Definition see Lee 

(TgMl). 


2.2 Motivating Examples 

In this section, we review important supervised and unsupervised statistical learning problems 
where the empirical performance/risk measure is of the form of a generalized U-statistics. They 
shall serve as running examples throughout the paper. 


2.2.1 Clustering 


Clustering refers to the unsupervised learning task that consists in partitioning a set of data points 
Xi, ..., Xn in a feature space A into a finite collection of subgroups depending on their similarity 
(in a sense that must be specified): roughly, data points in the same subgroup should be more 
similar to each other than to those lying in other subgroups. One may refer to Chapter 14 in 


Friedman et al. (2009) for an account of state-of-the-art clustering techniques. Formally, let M > 2 
be the number of desired clusters and consider a symmetric function D : A x A —> M+ such that 
D (x, x) = 0 for any x G A. D measures the dissimilarity between pairs of observations (x, x^ G A^: 
the larger D(x, x'), the less similar x and xh For instance, if A C D could take the form 
D(x,x') = A(||x — x'llq), where q > 1, ||a||q = for all a G and A : M+ —> M+ 

is any borelian nondecreasing function such that T(0) = 0. In this context, the goal of clustering 
methods is to find a partition V of the feature space A in a class Ff of partition candidates that 
minimizes the following empirical clustering risk: 


WniV) = 


nfn — 1 


Y_ D(X,,Xj)-(I)p(Xt,Xj), 


(5) 


l<i<j<n 


where 0'p(x,x') = I[{(x,x') G C^}. Assuming that the data Xp ..., Xn are i.i.d. realizations 

of a generic random variable X drawn from an unknown probability distribution F(dx) on A, the 


4 

















quantity Wn('P), also known as the intra-cluster similarity or within cluster point scatter, is a one 
sample U-statistic of degree two (K = 1 and di = 2) with kernel given by: 

y[x,x') e Hp(x,x'] = D(x,x') • (6) 

according to Dehnition provided that J x') • (I)'p(x,x')F(dx)F(dx') < +oo. The 

expectation of the empirical clustering risk Wn('P) is given by 


W{V) = E [D(X, X') • OriX, X')] , 


(7) 


where X' is an independent copy of the r.v. X, and is named the clustering risk of the partition 
V. The statistical analysis of the clustering performance of minimizers Vn of the empirical risk Q 
over a class FT of appropriate complexity can be found in Clemenqon (2014). Based on the theory of 
U-processes, it is shown in particular how to establish rate bounds for the excess of clustering risk of 
any empirical minimizer, W('PTi)~iiif'Pgn W('P) namely, under appropriate complexity assumptions 
on the cells forming the partition candidates. 


2.2.2 Metric Learning 


Many problems in machine learning, data mining and pattern recognition (such as the clustering 
problem described above) rely on a metric to measure the distance between data points. Choosing 
an appropriate metric for the problem at hand is crucial to the performance of these methods. 
Motivated by a variety of applications ranging from computer vision to information retrieval through 
bioinformatics, metric learning aims at adapting the metric to the data and has attracted a lot of 
interest in recent years (see for instance Bellet et ah, 2013, for an account of metric learning 


and its applications). As an illustration, we consider the metric learning problem for supervised 
classification. In this setting, we observe independent copies (Xi, Yi ],..., (Xn, Yn) of a random 
couple (X, Y], where the r.v. X takes values in some feature space X and Y in a finite set of labels, 
T = {1, • ■ ■, C} with C > 2 say. Consider a set V of distance measures D : A x A —> M+. Roughly 
speaking, the goal of metric learning in this context is to find a metric under which pairs of points 
with the same label are close to each other and those with different labels are far away. The risk 
of a metric D can be expressed as: 


R(D] = E [4) ((1 - D(X, X'] • (2I{Y = Y'} - 1))] , (8) 

where 4^(u) is a convex loss function upper bounding the indicator function I{u > 0}, such as the 
hinge loss 4^(u) = max(0, 1 — u). The natural empirical estimator of this risk is 

Rn(D) = —y 4)((D(Xt,Xj)-l).(2I{Y, = Yj}-l)), (9) 

which is a one sample U-statistic of degree two with kernel given by: 


Hd ((>^,y), (^',9)) = ct* ((D(x,x') - 1) • (21{y =y'}- 1)) . 


( 10 ) 


The convergence to ([8 ) of a minimizer of ([^ has been studied in the frameworks of algorithmic 
stability (Jin et ah, 2009), algorithmic robustness ( Bellet and Habrard[ 2015) and based on the 
theory of U-processes under appropriate regularization (Cao et ah, 2012). 
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2.2.3 Multipartite Ranking 


Given objects described by a random vector of attributes/features X € X and the (temporarily 
hidden) ordinal labels Y G {1, ..., K} assigned to it, the goal of multipartite ranking is to rank 
them in the same order as that induced by the labels, on the basis of a training set of labeled 
examples. This statistical learning problem finds many applications in a wide range of fields {e.g. 
medicine, finance, search engines, e-commerce). Rankings are generally defined by means of a 
scoring function s : T —> M, transporting the natural order on the real line onto the feature space 
and the gold standard for evaluating the ranking performance of s(x) is the ROC manifold, or 
its usual summary the VUS criterion (VUS standing for Volume Under the ROC Surface), see 


Clemengon and Robbiano (2014) and the references therein. In Clemengon et al. (2013), optimal 


scoring functions have been characterized as those that are optimal for all bipartite subproblems. 
In other words, they are increasing transforms of the likelihood ratio dFi^+i/dF]^, where F]^ denotes 
the class-conditional distribution for the k-th class. When the set of optimal scoring functions 
is non-empty, the authors also showed that it corresponds to the functions which maximize the 
volume under the ROC surface 


VUS(s] =P{s(Xi) <...<s(Xk]|Yi =1, Yk = K}. 

Given K independent samples (x|'^\ ..., Fk(dx) for k = 1, ..., K, the empirical counter¬ 

part of the VUS can be written in the following way: 

, n, Uk 

ws(s) = — Z • • ■ Z <... < s(x[|^’)}. (11) 

llk=l^ki,=l 1^=1 

The empirical VUS ( [IT| ) is a K-sample U-statistic of degree (1, ..., 1) with kernel given by: 

Fls(xi, ..., xk) =I{s(xi) < ... < s(xk]}. (12) 


2.3 Empirical Minimization of U-Statistics 

As illustrated by the examples above, many learning problems can be formulated as finding a 
certain rule g in a class Q in order to minimize a risk of the same form as ([^, with kernel 

H = Hg. Based on K > 1 independent i.i.d. samples 

..., n,} = (^1"^ • • • > With 1 < k < K, 

the ERM paradigm in statistical learning suggests to replace the risk by the U-statistic estimation 
Un(Flg) in the minimization problem. The study of the performance of minimizers gn of the empir¬ 
ical estimate Un(Hg) over the class G of rule candidates naturally leads to analyze the fluctuations 
of the U-process 

{U„(Hg)-p(Hg) : g eg}. (13) 

Given the bound 

M-(Hg„) - inf p(Hg) < 2sup|Un(Hg) - |J.(Hg]|, (14) 

966 gee 

a probabilistic control of the maximal deviation supggg |Un(Flg) — |j.(Hg]| naturally provides statis¬ 
tical guarantees for the generalization ability of the empirical minimizer gn. As shown at length in 
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the case K = 1 and di = 2 in Clemengon et al. (2008) and in Clemengon] (2014) for specific prob¬ 
lems, this can be achieved under adequate complexity assumptions of the class Hg = {Hg : g G G}. 
These results rely on the Hoeffding’s representation of U-statistics, which we recall now for clarity 
in the general multisample U-statistics setting. Denote by Sm the symmetric group of order m for 
any m > 1 and by a(i) the i-th coordinate of any permutation a G Sra for 1 < i < m. Let [zj be 
the integer part of any real number z and set 

N = min{[ni/dij, [riK/dKj}. 

Observe that the K-sample U-statistic (Q can be expressed as 


1 lk=l CTiSSn 


Z 


Vh X, 


41) 

V,(l)’ 


where 


Vh(xS^ 


X 


( 1 ) 

n, ) 


,, 


(K) 


y(k) 
• ) ^riK 


1 

N 


+ H X 


41) 

"di+l’ 


yd) 

■> '^2d, ’ 


H [X 
X 


( 1 ) 

1 ) 

(K) 


dK+1’ 


yd) 

■ ’ -^di > 

y(K) 

■ ■ ’ '^2dK 


y(K) 

’ "^O'K(nK) 


y(K) 

> , 


(15) 


y(K) 


-d... 


+ H X 


41) 

^(N-l)di+l’ 


yd) 

■ ’ '^Nd, ’ 


X 


(K) 

(N-l)dK + l’ 


y(K) 

' ’ '^NdK 


This representation, sometimes referred to as the first Hoeffding’s decomposition (see Hoeffding 


1948), allows to reduce a first order analysis to the case of sums of i.i.d. random variables. The 


following result extends Corollary 3 in Clemengon et al. (2008) to the multisample situation. 
Proposition 1. Let H he a collection of bounded symmetric kernels on ](([k=i '^k'^ such that 

|H(x)| < - 1 - 00 . (16) 


A A def 

Mn = 


sup 
lH,x)£nxX 


Suppose also that H is a VC major class of functions with finite Vapnik-Chervonenkis dimension 
V < -Hoo. For all 5 G (0,1), we have with probability at least 1 — 6, 


/2Vlog(l +N) 


log(1/5) I 


N 


(17) 


sup |Un(H) - p(H)| < Mn < 2a/+ 

H&H 

where N = min{[ni/dij, ..., [riK/dKj}. 

Observe that, in the usual asymptotic framework (Q, the bound 0 shows that the learning 
rate is, as expected, of order Op(y^logn/n], where n denotes the size of the pooled sample. 


Remark 1. (Uniform boundedness) We point out that condition (16) is clearly satisfied for 
the class of kernels considered in the multipartite ranking situation, whatever the class of scoring 
functions considered. In the case of the clustering example, it is fulfilled as soon as the essential 
supremum o/D(X,X'] -O-p/X, X') is uniformly bounded overV G IT, whereas in the metric learning 
example, it is satisfied when the essential supremum of the r.v. 4)((D(X, X') — 1) • (2I{Y = Y'}— 1)) is 
uniformly hounded over D G D. We underline that this simplifying condition can be easily relaxed 
and replaced by appropriate tail assumptions for the variables ..., H G 77, combining 

the arguments of the subsequent analysis with the classical “truncation trick” originally introduced 


in Fuk and Nagaev (1971). 
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Remark 2. (Complexity assumptions) Following in the footsteps of Clemengon et al. (2008) 


which considered 1 -sample U-statistics of degree 2, define the Rademacher average 


7^N = sup — 
H&n 




X 


( 1 ) 


X 


( 1 ) 

Idi ’ 


y(K) 

’ ^(l-l)dK + l’ 


X 


(K) 

IdK 


1=1 


(18) 


where ei, En are independent Rademacher random variables (random symmetric sign vari¬ 

ables), independent from the xj*^^ ’s. As can be seen by simply examining the proof of Proposition 
(Appendix^^, a control of the maximal deviations similar to ( [Xt] ) relying on this particular complex¬ 
ity measure can he obtained: the first term on the right hand side is then replaced by the expectation 
of the Rademacher average E[7^im], up to a constant multiplicative factor. This expected value can 
he bounded by standard metric entropy techniques and in the case where FL is a VC major class of 
functions of dimension V, we have: 


E[7^n] < Ady. 


2viog(N + r 


N 


See Appendix PX] for further details. 


3 Empirical Minimization of Incomplete U-Statistics 

We have seen in the last section that the empirical minimization of U-statistics leads to a learning 
rate of Op(y^log tx/ti). However, the computational cost required to find the empirical minimizer 
in practice is generally prohibitive, as the number of terms to be summed up to compute the 
U-statistic Q is equal to: 

' n ^\ /tlk' 

a) Uk. 

In the usual asymptotic framework it is of order j ^s n —> +oo. It is the major 

purpose of this section to show that, in the minimization problem, the U-statistic Un(Hg) can be 
replaced by a Monte-Carlo estimation, referred to as an incomplete M-statistic, whose computation 


requires to average much less terms, without damaging the learning rate (Section 3.1). We fur¬ 


ther extend these results to model selection (Section 3.2), fast rates situations (Section 3.3) and 
alternative sampling strategies (Section |3.4[). 


3.1 Uniform Approximation of Generalized U-Statistics 

As a remedy to the computational issue mentioned above, the concept of incomplete generalized 
U-statistic has been introduced in the seminal contribution of Blom ( 1976[ ). The calculation of 
such a functional involves a summation over low cardinality subsets of the (((J() d^-tuples of indices, 
1 < k < K, solely. In the simplest formulation, the subsets of indices are obtained by sampling 
independently with replacement, leading to the following definition. 

Definition 2. (Incomplete Generalized U-statistic) Let B > 1. The incomplete version of 
the \X-statistic Q based on B terms is defined by: 


Ub(H) = 1 


L 


HfX 


( 1 ) 




i=(i,, ...,iK)eDB 


le^B 


(19) 


























Sample of n=7 
observations 


Subsample of 
m = 4 observations 


Set of 
B = 6 pairs 




Figure 1: Illustration of the difference between an incomplete U-statistic and a complete U-statistic 
based on a subsample. For simplicity, we focus on the case K = 1 and di = 2. In this simplistic 
example, a sample of n = 7 observations is considered. To construct a complete U-statistic of 
reduced complexity, we first sample a set of m = 4 observations and then form all possible pairs 
from this subsample, i.e. B = m(m —1)/2 = 6 pairs in total. In contrast, an incomplete U-statistic 
with the same number of terms is obtained by sampling B pairs directly from the set A of all 
possible pairs based on the original statistical population. 


where T)-q is a set of cardinality B built by sampling with replacement in the set 


A = 1^’), dS 




i-d'!’)) : 1 < < • • • < < TLk, 1 < k < K}, 






( 20 ) 


an(iXi = (xjJ\ for alll = {h, Ir) GA. 


We stress that the distribution of a complete U-statistic built from subsamples of reduced sizes 
n(, drawn uniformly at random is quite different from that of an incomplete U-statistic based on 
B = nLi ( ^J^) terms sampled with replacement in A, although they involve the summation of the 
same number of terms, as depicted by Fig. 

In practice, B should be chosen much smaller than the cardinality of A, namely ffA = ])([k=i (dj)); 
in order to overcome the computational issue previously mentioned. We emphasize the fact that the 
cost related to the computation of the value taken by the kernel H at a given point ..., 

depending on the form of H is not considered here: the focus is on the number of terms involved in 


the summation solely. As an estimator of the statistic (19) is still unbiased, i.e. E[Ub(H)] = 


but its variance is naturally larger than that of the complete U-statistic Un(H). Precisely, 
writing the variance of the r.v. Ub(H) as the expectation of its conditional variance given (Xi)igA 
plus the variance of its conditional expectation given (Xi)igA) we obtain 

Var(UB(H]) = (l “ Var(Un(H)) + lvar(H(x'B, ..., (21) 

One may easily check that Var(UB(H]) > Var(Un(H)), and the difference vanishes as B increases. 


Refer to Lee (1990) for further details (see p. 193 therein). Incidentally, we underline that the 


empirical variance of (19) is not easy to compute either since it involves summing approximately 


ffA terms and bootstrap techniques should be used for this purpose, as proposed in Bert ail and 
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Tressou (2006). The asymptotic properties of incomplete U-statistics have been investigated in 


several articles, see Janson (1984); Brown and Kildea (1978); Enqvist (1978). The angle embraced 
in the present paper is of very different nature: the key idea we promote here is to use incomplete 
versions of collections of U-statistics in learning problems such as that described in Section |2.2[ 
The result stated below shows that this approach solves the numerical problem, while not damaging 
the learning rates under appropriate complexity assumptions on the collection % of (symmetric) 
kernels H considered, the complexity being described here in terms of VC dimension for simplicity. 
In particular, it reveals that concentration results established for U-processes (i.e. collections of U- 
statistics) such as Propositionmay extend to their incomplete versions, as shown by the following 
theorem. 


Theorem 1. (Maximal deviation) Let % he a collection of hounded symmetric kernels on 
Ok=i that fulfills the assumptions of Proposition^ Then, the following assertions hold true. 

(i) For all 5 G (0,1), with prohability at least 1—6, we have: Vn = (ni, ..., tlk ) G VB > 1, 


sup 

He-H 


Ub(H) 


Un(H) 


< Mu X 



Vlog(1 + #A) + log(2/5) 

B 


(a) For all 6 G (0,1), with probability at least ^ — 8, we have: Vn G VB > 1, 


1 


TT~ 

■i^n H&n 


Ub(H)-p(H) 


< 2 


2Vlog(l +N) ^ /log(2/6) 


N 


N 


Vlog(l + #A) -I- log(4/5) 


B 


where N = min{ [ni /di J,..., [TXx/dKj}. 

Remark 3. (Complexity assumptions continued) We point out that a hound of the same 
order as that stated above can he obtained under standard metric entropy conditions by means of 
classical chaining arguments, or under the assumption that the Rademacher average defined by 


TT-b = sup - 
H&n D 


Cb(I)H(Xi) 

b=l I ISA 


( 22 ) 


has an expectation of the order 0(1/\/B). The quantity Cb(I) indicates whether the subset of indexes 
1 has been picked at the b-th draw ("Cb(I) = ) or not ('Cb(I) = 0), see the calculation at the end 

of Appendix Equipped with this notation, notice that the fb’s are i.i.d. multinomial random 
variables such that ^b(I] = +1 • This assumption can he easily shown to he fulfilled in the case 

where TL is a VC major class of finite VC dimension (see the proof of Theorem^ in Appendix\^. 
Notice however that although the variables Cb(I)U(Xi), 1 < b < B, are conditionally i.i.d. 

given (Xi)igA; they are not independent and the quantity (22) cannot he related to complexity 
measures of the type (18) mentioned in Remark^ 


Remark 4. We underline that, whereas sup^g-^ |Un(H)— |j,(H)| can he proved to he of order Op(l /n) 
under adequate complexity assumptions in the specific situation where {Un(H) : H G 77} is a 
collection of degenerate U-statistics (see Section^S^, the hound (i) in Theorem^cannot he improved 
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in the degenerate case. Observe indeed that, conditioned upon the observations X| , the deviations 
of the approximation (19) from its mean are of order Op(l/\/B), since it is a basic average 0 /B 
i.i.d. terms. 

From the theorem stated above, one may straightforwardly deduce a bound on the excess risk 
of kernels Hb minimizing the incomplete version of the empirical risk based on B terms, i.e. such 
that 

Ub (Hb) =minUB(H). (23) 

V / H&H 

Corollary 1. Let % be a collection of symmetric kernels on Ok=l '^k'^ satisfies the conditions 
stipulated in Proposition^^ Let 6 > 0. For any minimizer Hb of the statistical estimate of the risk 
, the following assertions hold true 

(i) We have with probability at least 1 — 5; Vn G VB > 1, 


M,(Hb ) - ^n^ < 2Mn x 


^y 2Vlog(1+N) ^ 


log(2/6) , /^Vlogd +#A) + log(4/5)| 
+ -B-1 


(ii) We have: Vn G VB > 1, 


E 


sup 

Hew 


Ub(H)-p(H) 


< Mn 



N 


+ 


2Vlog(l+N) /2(log2 + Vlog(l+#A)) 


B 


The first assertion of Theorem provides a control of the deviations between the U-statistic 
(|^ and its incomplete counterpart ( |19[ ) uniformly over the class FL. As the number of terms 
B increases, this deviation decreases at a rate of 0(1/\/B). The second assertion of Theorem 
gives a maximal deviation result with respect to Observe in particular that, with the 

asymptotic settings previously specified, N = 0(n] and log(^A) = O(logn) as n —> + 00 . The 
bounds stated above thus show that, for a number B = B^ of terms tending to infinity at a rate 
0(n) as n —> + 00 , the maximal deviation supng-^ |blB(H) — |a.(H]| is asymptotically of the order 
Op((log(n]/n)^/^), just like sup^g^ |Hn(H) — see bound 0 in Proposition In short, 

when considering an incomplete ll-statistic (19) with B = O(rL) terms only, the learning rate for 


the corresponding minimizer is of the same order as that of the minimizer of the complete risk (Q, 
whose computation requires to average ffA = terms. Minimizing such incomplete 

U-statistics thus yields a significant gain in terms of computational cost while fully preserving the 
learning rate. In contrast, as implied by Proposition!^ the minimization of a complete U-statistic 
involving 0(n) terms, obtained by drawing subsamples of sizes n(, = j uniformly at 

random, leads to a rate of convergence of 0(\/log(n]/TiPH'+-+‘^'<)]^ which is much slower except 
in the trivial case where K = 1 and di = 1. These striking results are summarized in Table 

The important practical consequence of the above is that when n is too large for the complete 
risk 0 to be used, one should instead use the incomplete risk ( |19[ ) (setting the number of terms B 
as large as the computational budget allows). 
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Empirical risk criterion 

Nb of terms 

Rate bound 

Complete U-statistic 

Q(^dl+...+dKj 

Op(v^log(n)/n) 

Complete U-statistic based on subsamples 

0(n] 

Op ^\/log(n)/n‘ii+-+‘iK j 

Incomplete U-statistic (our result) 

0(u) 

Op(v^log(n)/n] 


Table 1: Rate bound for the empirical minimizer of several empirical risk criteria versus the number 
of terms involved in the computation of the criterion. For a computational budget of 0(n] terms, 
the rate bound for the incomplete U-statistic criterion is of the same order as that of the complete 
U-statistic, which is a huge improvement over a complete ll-statistic based on a subsample. 


3.2 Model Selection Based on Incomplete U-Statistics 


Automatic selection of the model complexity is a crucial issue in machine learning: it includes 
the number of clusters in cluster analysis (see Clemengon 2014) or the choice of the number of 


possible values taken by a piecewise constant scoring function in multipartite ranking for instance 
(c/. Clemengon and Vayatis, 2009). In the present situation, this boils down to choosing the 


adequate level of complexity of the class of kernels V., measured through its (supposedly finite) VC 
dimension for simplicity, in order to minimize the (theoretical) risk of the empirical minimizer. It is 


the purpose of this subsection to show that the incomplete U-statistic (19) can be used to define a 


penalization method to select a prediction rule with nearly minimal risk, avoiding procedures based 


on data sp 


see 


Vapnik 


infHew b(H 


i tting/ resampling and extending the celebrated structural risk minimization principle, 
(1999). Let Ti be the collection of all symmetric kernels on Ok=l p.* = 

. be a sequence of uniformly bounded major subclasses of H, of 


Let 

increasing complexity (VC dimension). For any m > 1, let Vra denote the VC dimension of the 
class Ffra and set M-Um — sap(Hxje'HmxA’ |H(^)I < +oo. We suppose that there exists M. < +oo 
such that sup,T^>i M-Um ^ Given 1 < B < and m > 1, the complexity penalized empirical 
risk of a solution Ub, 


of the ERM problem (23) with T-L = Tim. is 


blB(HB,ra) +pen(B,m), 

where the quantity pen(B,Ta] is a distribution free penalty given by: 


(24) 


pen(B,m) = 2Mn„ 


+ 2M 


/2 V^log(1^ ^ y 2aog2 + V,^log(1 +#A)y j 

(B -|- n) logm 
B2 ■ 


(25) 


As shown in Assertion (if) of Corollary the quantity above is an upper bound for the expected 
maximal deviation E[supBig-^^ |Ub(H) — p(H)|] and is thus a natural penalty candidate to compen¬ 
sate the overfitting within class Ffra- We thus propose to select 


itlb = 


argmin|UB(HB,m) +pen(B,m)| . 
m>l J 


(26) 
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As revealed by the theorem below, choosing B = O(rL), the prediction rule based on a penalized 
criterion involving the summation of O (n] terms solely, achieves a nearly optimal trade-off between 


the bias and the distribution free upper bound (25) on the variance term 


Theorem 2. (Oracle inequality) Suppose that Theorem^’s assumptions are fulfilled for all 
m > 1 and that supg^>-| Ad-Hm < Ad < +oo. Then, we have: Vn G VB G {1, ..., #A}, 


“ h* < inf < inf — p,*-|-pen(B, m) + Ad 

1 I 


-b 

B 


n 


We point out that the argument used to obtain the above result can be straightforwardly 
extended to other (possibly data-dependent) complexity penalties (c/. [Massart 2006), see the 
proof in Appendix [D| 


3.3 Fast Rates for ERM of Incomplete U-Statistics 


In Clemengon et ah (2008), it has been proved that, under certain “low-noise” conditions, the 


minimum variance property of the U-statistics used to estimate the ranking risk (corresponding 
to the situation K = 1 and di = 2) leads to learning rates faster than Op(l/-v/rL). These results 


rely on the Hajek projection, a linearization technique originally introduced in Hoeffding (1948) 


for the case of one sample U-statistics and next extended to the analysis of a much larger class of 


functionals in Hajek (1968). It consists in writing Un(H) as the sum of the orthogonal projection 


K nic 


Un(H) 


U„(H] I X 


(k) 


- (tl- 


(27) 


k=l i=l 


which is itself a sum of K independent basic sample means based on i.i.d. r.v.’s (of the order 
Op(l /^/rL) each, after recentering), plus a possible negligible term. This representation was used for 


instance by Grams and Serfling (1973) to refine the CLT in the multisample U-statistics framework. 


Although useful as a theoretical tool, it should be noticed that the quantity Un(H) is not of practical 
interest, since the conditional expectations involved in the summation are generally unknown. 


Although incomplete U-statistics do not share the minimum variance property (see Section 3.1), 


we will show that the same fast rate bounds for the excess risk as those reached by ERM of U- 
statistics (corresponding to the summation of O(n^) pairs of observations) can be attained by 
empirical ranking risk minimizers, when estimating the ranking risk by incomplete U-statistics 
involving the summation of o(n^) terms solely. 

For clarity (and comparison purpose), we first recall the statistical learning framework consid¬ 


ered in Clemengon et al. (2008). Let (X, Y) be a pair of random variables defined on the same 


probability space, where Y is a real-valued label and X models some input information taking its 
values in a measurable space A hopefully useful to predict Y. Denoting by (X', Y') an independent 
copy of the pair (X, Y). The goal pursued here is to learn how to rank the input observations X and 
X', by means of an antisymmetric ranking rule r : —) {—1, -1-1} (i.e. r(x, x') = —r(x'x) for any 
(x, xO G X^), so as to minimize the ranking risk 


L(r) =E{(Y-Y') •r(X,X') < 0). 


(28) 


The minimizer of the ranking risk is the ranking rule r*(X,X'] = 2I{P{Y > Y' | (X,X']} > E{Y < 
y I (X,X')}— 1 (see Proposition 1 in Clemengon et ah, 2008). The natural empirical counterpart 
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of (28) based on a sample of independent copies (XpYi), (XnjYn) of the pair (X,Y) is the 
1-sample U-statistic Un(Hr) of degree two with kernel Hr((x,y), “UO • t’(x, x') < 0} 

for all (x,y) and (x',y')) in Af x M given by: 


Ln(r) = = 


nfn — 1 


^I{(Yi-Yj)-r(Xi,Xj)<0}. 


(29) 


>•<) 


Equipped with these notations, a statistical version of the excess risk A(r) = L(r) — L(r*) is a 
U-statistic An(t) with kernel qr = Hr —Hr*. The key “noise-condition”, which allows to exploit the 
Hoeffding/Hajek decomposition of An.(r), is stated below. 

Assumption 1. There exist constants c > 0 and a G [0,1] sueh that: 

Vr G Var (hr(X, Y)) < cA(t)“, 
where we set hr(x,y) = E[qr((x,y), (X', Y')]. 

Recall incidentally that very general sufficient conditions guaranteeing that this assumption 
holds true have been exhibited, see Section 5 in Clemengon et al. (2008) (notice that the condition 
is void for a = 0). Since our goal is to explain the main ideas rather than achieving a high level of 
generality, we consider a very simple setting, stipulating that the cardinality of the class of ranking 
rule candidates TZ under study is finite, #7^ = M < -|-oo, and that the optimal rule r* belongs to 


TZ. The following proposition is a simplified version of the fast rate result proved in Clemengon 


et al. (2008) for the empirical minimizer Tn = argminrgT^ Ln(r). 


Proposition 2. (Clemengon et al. (2008), Corollary Q) Suppose that Assumption^is fulfilled. 
Then, there exists a universal constant C > 0 such that for all 5 G (0,1), we have: Vn > 2, 


L(rn)-L(r*) < C 


^log(M/6) ^ 2 -a 


n 


(30) 


Consider now the minimizer tg of the incomplete U-statistic risk estimate 

1 

UB(Hr) = -}^ L ek((bj))I{(Yi-Yj)-r(X,,Xj)<0} 


(31) 


k=l <i<j<TL 


over TZ, where j)) indicates whether the pair (i, j) has been picked at the k-th draw (eic((i, j)) = 
1 in this case, which occurs with probability V©) (then, we set eic{(l, j)) =0). Observe 

that Tg also minimizes the empirical estimate of the excess risk Ag(r) = Ug(qr) over TZ. 

Theorem 3. Let a G [0,1] and suppose that Assumption^ is fulfilled. If we set B = 
there exists some constant C < -|-oo such that, for all 8 G (0,1), we have with probability at least 
1 -5.- Vn>2, 

/log(M/6]\ 


L(?g]-L(r*) < C 


1 

2—oc 




n 
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Sampling with 
replacement 


Sampling without 
replacement 


Bernoulli 

Sampling 


Figure 2: Illustration of different sampling schemes for approximating a U-statistic. For simplicity, 
consider again the case K = 1 and di = 2. Here n = 7 and the expected number of terms is B = 6 . 
Sampling with or without replacement results in exactly B terms, with possible repetitions when 
sampling with replacement, e.g. (xgjX/) in this example. In contrast, Bernoulli sampling with 
7 Ti = B/t^A results in B terms only in expectation, with individual realizations that may exhibit 
more or fewer terms. 


As soon as a < 1, this result shows that the same fast rate of convergence as that reached by 
Tn can be attained by the ranking rule tg, which minimizes an empirical version of the ranking risk 
involving the summation of terms solely. For comparison purpose, minimization of the 


criterion (28) computed with a number of terms of the same order leads to a rate bound of order 

Op(nV(2-V). 

Finally, we point out that fast rates for the clustering problem have been also investigated in 


Clemengon (2014), see Section 5.2 therein. The present analysis can be extended to the clustering 


framework by means of the same arguments. 


3.4 Alternative Sampling Schemes 

Sampling with replacement is not the sole way of approximating generalized U-statistics with a 
controlled computational cost. As proposed in Janson (1984), other sampling schemes can be 
considered, Bernoulli sampling or sampling without replacement in particular (see Figure]^ for an 
illustration). We now explain how the results of this paper can be extended to these situations. 
The population of interest is the set A and a survey sample of (possibly random) size b < n is any 
subset s of cardinality b = b(s) less than ^A in the power set V[A). Here, a general survey scheme 
without replacement is any conditional probability distribution R on the set of all possible samples 
s gV[A) given (Xi]igA- For any I G A, the hrst order inclusion probability 7 Ti(R) = Pr{I G S}, is 
the probability that the unit I belongs to a random sample S drawn from distribution R. We set 
7 t(R) = (7ri(R))igA- The second order inclusion probabilities are denoted by 7 rij(R) = Pr{(I, J) G S^} 
for any I 7 ^ J in A. When no confusion is possible, we omit to mention the dependence in R when 
writing the first/second order probabilities of inclusion. The information related to the observed 
sample S C A is fully enclosed in the random vector A = (A(I))igAj where A(I] = I{I G S} for all 
IgA. The 1-d marginal distributions of the sampling scheme An are the Bernoulli distributions 
with parameters 7 ti, I G A, and the covariance matrix of the r.v. An is given by F = {ttij — 7 ri 7 tj}j j 
with the convention ttj i = tti for all I G A. Observe that, equipped with the notations above. 
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LieA^(I) = 

One of the simplest survey plans is the Poisson scheme (without replacement), for which the 
A(l)’s are independent Bernoulli random variables with parameters tti, 1 G A, in (0,1). The hrst 
order inclusion probabilities fully characterize such a plan. Observe in addition that the size b(5) 
of a sample generated this way is random with expectation B = E[b(S) | (Xi)igA] = 

The situation where the Ttj’s are all equal corresponds to the Bernoulli sampling scheme: VI G A, 
7ti = The Poisson survey scheme plays a crucial role in sampling theory, inso far as a 

wide range of survey schemes can be viewed as conditional Poisson schemes, see Hajek (1964). 


For instance, one may refer to Cochran (1977) or Deville (1987) for accounts of survey sampling 
techniques. 


Following in the footsteps of the seminal contribution of Horvitz and Thompson (1951), an estl 


mate of 0 based on a sample drawn from a survey scheme R with first order inclusion probabilities 
(7ri)igA is given by: 


Uht(H) = — 


#A 


ISA 


7ti 


(32) 


with the convention that 0/0 = 0. Notice that it is an unbiased estimate of (§: 

E[Uht(H) I (Xi)i6a] =U„(H]. 

In the case where the sample size is deterministic, its conditional variance is given by: 


Var(UHT(H) | (Xi)i6a) = ^ 


¥1 ^ 


^ H(Xi) H(Xj) Y 

TTi Ttj ) 


(tTij — TTiTTj). 


We point out that the computation of (32) involves summing over a possibly random number of 


terms, equal to B = E[b(S)] = in average and whose variance is equal to Var(b(S)] = 

LISA Tti (1 -7ri) + £i^j{7tlj-7tl7tj}. 

Here, we are interested in the situation where the A(l)’s are independent from (Xi]igA, and 
either a sample of size B < ^A fixed in advance is chosen uniformly at random among the (^^) 
possible choices (this survey scheme is sometimes referred to as rejective sampling with equal hrst 
order inclusion probabilities), or else it is picked by means of a Bernoulli sampling with parameter 
B/^A. Observe that, in both cases, we have tti = B/^A for all 1 G A. The following theorem 
shows that in both cases, similar results as those obtained for sampling with replacement can be 


derived for minimizers of the Horvitz-Thompson risk estimate (32). 

Theorem 4. Let LL be a collection of bounded symmetric kernels on nLi that fulfills the 
assumptions involved in Proposition^ Let B G {1, ..., #A}. Suppose that, for any H G 77, 
UHT(bl) is the incomplete U-statistic based on either a Bernoulli sampling scheme with parameter 
B/^A or else a sampling without replacement scheme of size B. For all 8 G (0,1), we have with 
probability at least 1 — 6.- Vn G VB G {1, ..., #A}, 


ir, rue ,, /log(2(l + #A)V/5) , 2 log[l[] + #A)^/ 8 )Mu 

sup Uht(H) -Un(H) <2Mn\ -- h 

H&n V 

in the case of the Bernoulli sampling design, and 


B 


3B 


sup |Uht(H) 

Hen V b 
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in the case of the sampling without replacement design. 


We highlight the fact that, from a computational perspective, sampling with replacement is 
undoubtedly much more advantageous than Bernoulli sampling or sampling without replacement. 
Indeed, although its expected value is equal to B, the size of a Bernoulli sample is stochastic and 
the related sampling algorithm requires a loop through the elements I of A and the practical imple¬ 
mentation of sampling without replacement is generally based on multiple iterations of sampling 


with replacement, see Tille (2006). 


4 Application to Stochastic Gradient Descent for ERM 

The theoretical analysis carried out in the preceding sections focused on the properties of empirical 
risk minimizers but ignored the issue of finding such a minimizer. In this section, we show that 
the sampling technique introduced in Section also provides practical means of scaling up iterative 
statistical learning techniques. Indeed, large-scale training of many machine learning models, such 
as SVM, Deep Neural Networks or soft K-means among others, is based on stochastic 
gradient descent (SGD in abbreviated form), see Bottou ( |1998 ). When the risk is of the form ([^, 
we now investigate the benefit of using, at each iterative step, a gradient estimate of the form of 
an incomplete U-statistic, instead of an estimate of the form of a complete U-statistic with exactly 
the same number of terms based on subsamples drawn uniformly at random. 

Let 0 C with q > 1 be some parameter space and H : ]dk=i x 0 —> M be a loss function 

which is convex and differentiable in its last argument. Let 1 < k < K, be K 

independent random vectors with distribution on respectively such that the random 

vector H(X^^’, ..., X^^’, 
set 


y9<) 

) 1 


fKl 

I X)j^; 0) is square integrable for any 0 G 0. For all 0 G 0, 

L(0) = E[H(XS'’, ..., X'^’, ..., X'’"’, ..., 0)] = 0)) 

and consider the risk minimization problem min0g0L(0). Based on K independent i.i.d. samples 


X 


(k) 


with 1 < k < K, the empirical version of the risk function is 0 G 0 e-) Ln(0) '=' 


def 


bIn(H(-; 0)). Here and throughout, we denote by Ve the gradient operator w.r.t. 0. 


Gradient descent Many practical machine learning algorithms use variants of the standard 
gradient descent method, following the iterations: 


0t+l — 0t ~'ntV0Ln(0t), 


(33) 


with an arbitrary initial value 0o G 0 and a learning rate (step size) qt > 0 such that Bt = +oo 
and Bt < +oo. 

Here we place ourselves in a large-scale setting, where the sample sizes ni, ..., uk of the 
training data sets are so large that computing the gradient of Ln 


gn(0) = 


1 


nh, (;;) 


die'' Ii Ik 


(34) 


at each iteration (33) is computationally too expensive. Instead, Stochastic Gradient Descent uses 


an unbiased estimate g(0) of the gradient (34) that is cheap to compute. A natural approach 
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consists in replacing (34) by a complete U-statistic constructed from subsamples of reduced sizes 


<< rik drawn uniformly at random, leading to the following gradient estimate: 


gn'(0) = 


nLi G:) 


J^...}^VeH(xG;XgG..;xS|;’; 0), 


(35) 


where the symbol refers to summation over all subsets x|J^^ = , X^^) related to 

a set Ik of dk indexes 1 < ii < ... < idk < "n-k > ^k)- 

We propose an alternative strategy based on the sampling scheme described in Section]^ i.e. 
a gradient estimate in the form of an incomplete U-statistic: 

9b(0) = ^ L V0H(xSG 0], (36) 


where Pb is built by sampling with replacement in the set A. 

It is well-known that the variance of the gradient estimate negatively impacts on the conver¬ 
gence of SGD. Consider for instance the case where the loss function H is (1 /y)-smooth in its last 
argument, i.e. V 0 i ,02 G 0: 

||VeH(-; 0i]-V0H(-; 02 )|| < -|| 0 i - 02 ||. 

y 

Then one can show that if g is the gradient estimate: 

E[U(0t+i)] = E[L„(0t-Titg(0t))] 

< E[L„(0t)] -Tit||E[g„(0t)]f + 5^E[||g(0t]f] 

2y 

< E[Ln(0t)] -rit ^1 - E[||gn(0t)f] -k ^Var[g(0t)]. 


In other words, the smaller the variance of the gradient estimate, the larger the expected reduction 
in objective value. Some recent work has focused on variance-reduction strategies for SGD when 


the risk estimates are basic sample means (see for instance Le Roux et ah, 2012; Johnson and 


Zhang, 2013). 


In our setting where the risk estimates are of the form of a U-statistic, we are interested 
in comparing the variance of gn'(0) and gB(0) when B = Ok=i (aG their computation 

requires to average over the same number of terms and thus have similar computational costj^ Our 
result is summarized in the following proposition. 

Proposition 3. Let B = Ok=i (aj)) 'n-k 'T'-k; k = 1 , ..., K. In the asymptotie framework 
we have: 


Var[gn/(0)] = O 
as u' = n| -k ... -k n(. —> -koo. 




Var[gB(0)] = O 


nti (j) 


^Note that sampling B sets from A to obtain 
X{k...,ri|j} for each k = 1,..., K and then forming all combinations to obtain (35 I. 


is potentially more efficient than sampling n(. points from 
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Proposition shows that the convergence rate of Var[gB(0]] is faster than that of Var[gn/(0)] 
except when K = 1 and di = 1. Thus the expected improvement in objective function at each 


SGD step is larger when using a gradient estimate in the form of (36) instead of (35), although 


both strategies require to average over the same number of terms. This is also supported by the 
experimental results reported in the next section. 

5 Numerical Experiments 

We show the benefits of the sampling approach promoted in this paper on two applications: metric 
learning for classification, and model selection in clustering. 


5.1 Metric Learning 


In this section, we focus on the metric learning problem (see Section 2.2.2). As done in much of 
the metric learning literature, we restrict our attention to the family of pseudo-distance functions 
Dm : X —> M+ defined as 


where M G S+, and is the cone of d x d symmetric positive-semidefinite (PSD) matrices. 

Given a training sample {(xi,yi)}lhi where Xi G and yt G {1,..., C}, let yy = 1 if y^ = yj 
and 0 otherwise for any pair of samples. Given a threshold b > 0, we define the empirical risk as 
follows: 


Rn(DM) — 


nfn — 1 


Y_ [yij(b-DM(Xi,Xj))] 
1 <i<j<n 


(37) 


where [u]+ = max(0,1 — u) is the hinge loss. This risk estimate is convex and was used for instance 


by Jin et al. (2009) and Gao et al. (2012). Our goal is the find the empirical risk minimizer among 


our family of distance functions, i.e.: 


M = argminRn(DM) 

MeSj 


(38) 


In our experiments, we use the following two data sets: 


Synthetic data set: some synthetic data that we generated for illustration. X is a mixture 
of 10 gaussians in - each one corresponding to a class - such that all gaussian means are 
contained in an subspace of dimension 15 and their shared covariance matrix is proportional 
to identity with a variance factor such that some overlap is observed. That is, the solution to 
the metric learning problem should be proportional to the linear projection over the subspace 
containing the gaussians means. Training and testing sets contain respectively 50,000 and 
10,000 observations. 

MNIST data set: a handwritten digit classification data set which has 10 classes and 
consists of 60,000 training images and 10,000 test images^ This data set has been used 
extensively to benchmark metric learning ( Weinberger and Saul[ 2009). As done by previous 
authors, we reduce the dimension from 784 to 164 using PGA so as to retain 95% of the 
variance, and normalize each sample to unit norm. 


^http: //ycLiin. lecun. com/exdb/mnist/ 
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(a) Synthetic data set (b) MNIST data set 

Figure 3: Test risk with respect to the sample size p of the ERM when the risk is approximated 
using complete (blue) or incomplete (red) U-statistics. Solid lines represent means and dashed 
ones represent standard deviation. For the synthetic data set, the green dotted line represent the 
performance of the true risk minimizer. 


Note that for both datasets, merely computing the empirical risk (37) for a given M. involves 
averaging over more than 10^ pairs. 

We conduct two types of experiment. In Section [5.1.H we subsample the data before learning 
and evaluate the performance of the ERM on the subsample. In Section 5.1.2, we use Stochastic 
Gradient Descent to find the ERM on the original sample, using subsamples at each iteration to 
estimate the gradient. 


5.1.1 One-Time Sampling 

We compare two sampling schemes to approximate the empirical risk: 


• Complete U-statistic: p indices are uniformly picked at random in {1, ... ,n}. The empirical 

risk is approximated using any possible pair formed by the p indices, that is pairs. 

• Incomplete U-statistic: the empirical risk is approximated using pairs picked uniformly 

at random in {1,..., n}^. 


For each strategy, we use a projected gradient descent method in order to solve (38), using 
several values of p and averaging the results over 50 random trials. As the testing sets are large, 
we evaluate the test risk on 100,000 randomly picked pairs. 

Figure |3(a) shows the test risk of the ERM with respect to the sample size p for both sampling 
strategies on the synthetic data set. As predicted by onr theoretical analysis, the incomplete U- 
statistic strategy achieves a significantly smaller risk on average. For instance, it gets within 5% 
error of the true risk minimizer for p = 50, while the complete U-statistic needs p > 80 to reach 
the same performance. This represents twice more computational time, as shown in Figure 4(a) (as 
expected, the runtime increases roughly quadratically with p). The incomplete U-statistic strategy 
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(a) Synthetic data set (b) MNIST data set 

Figure 4: Average training time (in seconds) with respect to the sample size p. 


also has the advantage of having a much smaller variance between the runs, which makes it more 
reliable. The same conclusions hold for the MNIST data set, as can be seen in Figure |3(b)| and 
Figure [4(b)[ 


5.1.2 Stochastic Gradient Descent 


In this section, we focus on solving the ERM problem (38) using Stochastic Gradient Descent and 


compare two approaches (analyzed in Section]^ to construct a mini-batch at each iteration. The 
first strategy, SGD-Complete, is to randomly draw (with replacement) a subsample and use the 
complete U-statistic associated with the subsample as the gradient estimate. The second strategy, 
SGD-Incomplete (the one we promote in this paper), consists in sampling an incomplete U-statistic 
with the same number of terms as in SGD-Complete. 


For this experiment, we use the MNIST data set. We set the threshold in (37) to b = 2 and 
the learning rate of SGD at iteration t to r|t = 1/(1101) where po G {1,2.5,5,10,25,50}. To reduce 
computational cost, we only project our solution onto the PSD cone at the end of the algorithm, 
following the “one projection” principle used by Chechik et al. (2010). We try several values m 


for the mini-batch size, namely m G (10,28,55, 105,253}j^ For each mini-batch size, we run SGD 
for 10,000 iterations and select the learning rate parameter rjo that achieves the minimum risk on 
100,000 pairs randomly sampled from the training set. We then estimate the generalization risk 
using 100,000 pairs randomly sampled from the test set. 

For all mini-batch sizes, SGD-Incomplete achieves significantly better test risk than SGD- 
Complete. Detailed results are shown in Figure for three mini-batch sizes, where we plot the 
evolution of the test risk with respect to the iteration number]^ We make several comments. 
First, notice that the best learning rate is often larger for SGD-Incomplete than for SGD-Complete 


^For each m, we can construct a complete U-statistic from n' samples with n'[n' — 1)/2 = m terms. 

^We point out that the figures look the same if we plot the runtime instead of the iteration number. Indeed, the 
time spent on computing the gradients (which is the same for both variants) largely dominates the time spent on the 
random draws. 
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Figure 5: SGD results on the MNIST data set for various mini-batch size m. The top row shows 
the means and standard deviations over 50 runs, while the bottom row shows each run separately. 


(m = 10 and m = 253). This confirms that gradient estimates from the former strategy are gen¬ 
erally more reliable. This is further supported by the fact that even though larger learning rates 
increase the variance of SGD, in these two cases SGD-Gomplete and SGD-Incomplete have similar 
variance. On the other hand, for m = 55 the learning rate is the same for both strategies. SGD- 
Incomplete again performs significantly better on average and also has smaller variance. Lastly, 
as one should expect, the gap between SGD-Gomplete and SGD-Incomplete reduces as the size 
of the mini-batch increases. Note however that in practical implementations, the relatively small 
mini-batch sizes (in the order of a few tens or hundreds) are generally those which achieve the best 
error/time trade-off. 


5.2 Model Selection in Clustering 


In this section, we are interested in the clustering problem described in Section 2.2.1 Specifically, 
let Xi, ..., Xn G be the set of points to be clustered. Let the clustering risk associated with a 
partition V into M groups Ci,..., Cm be: 


M 


Wn[P] = 


nfn-1 


D(Xt,Xj) •I{(Xt,Xj) eC^}. 

m=l 1 <i<j<Ti 


(39) 


In this experiment, given a set of candidate partitions, we want to perform model selection by 


picking the partition which minimizes the risk (39) plus some term penalizing the complexity of 


the partition. When the number of points u is large, the complete risk is very expensive to com¬ 
pute. Our strategy is to replace it with an incomplete approximation with much fewer terms. Like 


22 




















































10 


6 

w 

■qC 4 


Complete U-Statistic 
Incomplete U-Statlstic (B=5000) 


5 10 15 

Number of clusters 

(a) Risk 


20 


10 


4^ 

to 


■D 

o; 


03 

c 

cu 

Q_ 


Complete U-Statistic 
Incomplete U-Statistic (B=5000) 


• • 


*••••••< 






5 10 15 

Number of clusters 

(b) Penalized risk 


20 


Figure 6: Clustering model selection results on the forest cover type data set. Figure 6(a) shows the 
risk (complete and incomplete with B = 5,000 terms) for the first 20 partitions, while Figure [6(b)| 
shows the penalized risk for c = 1.1. 


in the approach theoretically investigated in Section 3.2, the goal here is to show that using the 


incomplete approximation instead of the complete version as the goodness-of-fit measure in a com¬ 
plexity penalized criterion does not damage the selection, while reducing the computational cost. 
For simplicity, the complexity penalty we use below is not of the same type as the structural VC 
dimension-based penalty considered in Theorem but we will see that the incomplete approxima¬ 
tion is very accurate and can thus effectively replace the complete version regardless of the penalty 
used. 

The experimental setup is as follows. We used the forest cover type data set which is popular 
to benchmark clustering algorithms (see for instance Kanungo et ah, 2004). To be able to evaluate 
the complete risk, we work with n = 5,000 points subsampled at random from the entire data set 
of 581,012 points in dimension 54. We then generated a hierarchical clustering of these points using 
agglomerative clustering with Ward’s criterion (Ward, 1963) as implemented in the scikit-learn 
Python library (Pedregosa et al., 2011). This defines n partitions Pi,..., Vn where Pra consists of 
m clusters (Pi corresponds to a single cluster containing all points, while in Pn each point has its 
own cluster). 

For each partition size, we first compare the value of the complete risk (39) with n(n — 1) = 
24,995,000 terms with that of an incomplete version with only B = n = 5,000 pairs drawn at 
random. As shown in Figure 6(a), the incomplete U-statistic is a very accurate approximation of 


the complete one, despite consisting of 5000 times less terms. It will thus lead to similar results in 
model selection. To illustrate, we use a simple penalty term of the form pen(Pra) = cTog(m) where 
c is a scaling constant. Figure [6(b)| shows that both selection criteria choose the same model Pg. 
Performing this model selection over Pi,..., P 20 took about 66 seconds for the complete U-statistic, 
compared to only 0.1 seconds for the incomplete version]^ 


“https://archive.ics.uci.edu/ml/datasets/Covertype 

®The n X n distance matrix was precomputed before running the agglomerative clustering algorithm. The associ¬ 
ated runtime is thus not taken into account in these timing results. 
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Finally, we generated 100 incomplete U-statistics with different random seeds ; all of them 
correctly identified "Pg as the best model. Using B = 5,000 pairs is thus sufficient to obtain reliable 
results with an incomplete U-statistic for this data set. In contrast, the complete U-statistics based 
on a subsample (leading to the same number of pairs) selected the correct model in only 57% of 
cases. 


6 Conclusion 

In a wide variety of statistical learning problems, U-statistics are natural estimates of the risk 
measure one seeks to optimize. As the sizes of the samples increase, the computation of such func¬ 
tionals involves summing a rapidly exploding number of terms and becomes numerically unfeasible. 
In this paper, we argue that for such problems. Empirical Risk Minimization can be implemented 
using statistical counterparts of the risk based on much less terms (picked randomly by means of 
sampling with replacement), referred to as incomplete \l-statistics. Using a novel deviation in¬ 
equality, we have shown that this approximation scheme does not deteriorate the learning rates, 
even preserving fast rates in certain situations where they are proved to occur. Furthermore, we 
have extended these results to U-statistics based on different sampling schemes (Bernoulli sam¬ 
pling, sampling without replacement) and shown how such functionals can be used for the purpose 
of model selection and for implementing ERM iterative procedures based on stochastic gradient 
descent. Beyond theoretical rate bounds, the efficiency of the approach we promote is illustrated 
by several numerical experiments. 
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A Proof of Proposition 
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for any H G 77 Recall that the K-sample U-statistic Un(H) can be expressed as 
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where Sttl denotes the symmetric group of order m for any m > 1. This representation as an 
average of sums of N independent terms is known as the (first) Hoeffding’s decomposition, see 


24 





Hoeffding (1948). Then, using Jensen’s inequality in particular, one may easily show that, for any 
nondecreasing convex function : M+ — 
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( sup Un(H] ) 
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. VHeH 


we have: 


V rvld vd) 


(41) 


arguments (see Gine and Zinn (1984) for instance) and (41), we obtain that 


where we set H = H — for all H € Ti. Now, using standard symmetrization and randomization 

(42) 


E 


sup |Un(H)| 
He-H 


< E [4) (27^N)], 


where 
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Hen 


1=1 


N .Zl (^(l-l)di+l > • • • > ^Idi > • • ■ > ^(l-l)dK+l’ ■ • • ’ ^ 


dK) 

IdK 


is a Rademacher average based on the Rademacher chaos ei, ..., cn (independent random sym¬ 
metric sign variables), independent from the xj'^^’s. We now apply the bounded difference in¬ 
equality (see [McDiarmid (1989)) to the functional 72 n, seen as a function of the i.i.d. random 
variables (g, x{^j,), 1 < 1 < N: changing any of these 
random variables change the value of 72 .n by at most One thus obtains from (42) with 

r[)(x] = exp(Ax), where A > 0 is a parameter which shall be chosen later, that: 


E 


exp ( A sup Un(H) 
He-H 


< exp ^2AE[72 n] + 


Applying Chernoff’s method, one then gets: 


sup |Un(H)| > ri < exp ( —Aq -|- 2AE[72n] -|- 
Hen I V 4N 


(43) 


(44) 


Using the bound (see Eq. (6) in Boucheron et ah (2005) for instance) 


2viog(i +n; 


N 


E[72n] < Mn^ 

and taking A = 2N(q — 2E[72 n])/A 4^ in (44), one finally establishes the desired result. 


B Proof of Theorem [I] 

For convenience, we introduce the random sequence C = ((Ck(I))ieA)i<k<B) where Ck(I) is equal to 
1 if the tuple I = (Ii, ..., R) has been selected at the k-th draw and to 0 otherwise: the Ck’s are 
i.i.d. random vectors and, for all (k, I) G {1, ..., B} x A, the r.v. Ck(I) has a Bernoulli distribution 
with parameter 1/#A. We also set Xj = x||^^) for any I in A. Equipped with these 

notations, observe first that one may write: VB > 1, Vn G 

1 

Ub(H)-U„(H) = 

° k=l 
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where ~ V#^)H(Xi) for any (k, I) G {1, ..., B} x A. It follows from the 

independence between the Xj’s and the C(I)’s that, for all H £ "H, conditioned upon the Xj’s, 
the variables Z-q[W) are independent, centered and almost-surely bounded by IMu 

(notice that ~ ^ h > 1). By virtue of Sauer’s lemma, since % \s a. VC major 

class with finite VC dimension V, we have, for fixed Xj’s; 

#{(H(Xi))igA: (1 


Hence, conditioned upon the Xi’s, using the union bound and next Hoeffding’s inequality applied 
to the independent sequence Z-\ (H), ..., Zb(H), for all q > 0, we obtain that: 


sup 

Hew 




>11 I (Xi)igA > < 
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k=l 


> B I (Xi)igA 




Taking the expectation, this proves the first assertion of the theorem. Notice that this can be 
formulated: for any 6 G (0,1), we have with probability at least 1 — 6: 


sup 

Hew 


Ub(H)-U„(H) 


< M-h X \/2 


,Vlog(l + #A) +log(2/6) 


B 


Turning to the second part of the theorem, it straightforwardly results from the first part 
combined with Proposition!^ 


C Proof of Corollary 

Assertion (i) is a direct application of Assertion (ii) in Theorem combined with the bound 
M,(Hb) - infHew li(H) < 2supHg-H - p(H)|. 

Turning next to Assertion (li), observe that by triangle inequality we have: 
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(45) 


The same argument as that used in Theorem [T (with 
the second term on the right hand side of Eq. (4M: 


u for any u > 0) yields a bound for 
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sup |Un(H) - 
Hew 


< IM-T-i 


2Vlog(l +N) 
N ' 


(46) 


The first term can be controlled by means of the following lemma, whose proof can be found for 
instance in Lugosi (2002 Lemmas 1.2 and 1.3). 


Lemma 1. The following assertions hold true. 


(i) Hoeffding’s lemma. Let Z be an integrable r.v. with mean zero such that a < Z < b almost- 
surely. Then, we have: Vs > 0 

E[exp(sZ)] < exp ^s^(b — a)^/8^ . 
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(a) Let M > 1 and Zi, ..., Zm be real valued random variables. Suppose that there exists a > 0 
such that Vs G M; E[exp(sZi)] < for alii € , M}. Then, we have: 
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max IZil 
l<i<M 


< aY^21og(2M). 


(47) 


Assertion (i) shows that, since —Ain < -Zk(H) < Ain almost surely, 
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With a = and M = : H G 77} < (1 + #A)^, conditioning upon (Xi)igA, this 

result yields: 
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(48) 


Integrating next over (Xi)igA and combining the resulting bound with (45) and (46) leads to the 
inequality stated in (it). 

A bound for the expected value. For completeness, we point out that the expected value of 
supi-ig-^ 1(1/B) Zi^(H)| can also be bounded by means of classical symmetrization and ran¬ 

domization devices. Considering a ’’ghost” i.i.d. sample Cj, • • ■, Cb independent from ((Xi)igA, C), 
distributed as C, Jensen’s inequality yields: 
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Introducing next independent Rademacher variables ei, ..., Gb, independent from ((Xi)igA, C, 70; 
we have: 
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We thus obtained: 
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D Proof of Theorem [2] 


We start with proving the intermediary result, stated below. 

Lemma 2. Under the assumptions stipulated in Theorem^ we have: Vm > 1, Ve > 0, 
pj^su^p |^(H) - Ub(H)| > 2Mn^ I + #A)) I ^ 

<exp(-B^eV (2(B + n)At|,J). 

Proof. This is a direct application of the bounded difference inequality (see McDiarmid (1989)) 
applied to the quantity sup^g-^^ Im-(H) — Ub(H)|, viewed as a function of the (B + n] independent 

random variables Xn^\ei, eb) (jumps being bounded by lAiu/B), combined with 

Assertion (ii) of Corollary □ 

Let m > 1 and decompose the expected excess of risk of the rule picked by means of the 
complexity regularized incomplete U-statistic criterion as follows: 


McDiarmid (1989 
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- Ub (HB,fii) - pen(B, in 


,m^ 


+ E 


inf |uB(HB,j) +pen(B,j)| 


where we set = infHe-Hm b(H)- In order to bound the first term on the right hand side of the 
equation above, observe that we have: Ve > 0, 
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using successively the union bound and Lemma Integrating over [0,+oo), we obtain that: 

, V'27t(B + n] 
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(49) 


Considering now the second term, notice that 
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inf |uB(HB,j) +pen(B,j)| 


< E 


klB(HB,m)+ pen(B,m) - < pen(B,m). 
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Combining the bounds, we obtain that: Vm > 1, 
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< + pen(B, nv) + M 


•\/27r(B~+n) 
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The oracle inequality is thus proved. 


E Proof of Theorem [3] 


We start with proving the following intermediary result, based on the U-statistic version of the 
Bernstein exponential inequality. 

Lemma 3. Suppose that the assumptions of Theorem^ are fulfilled. Then, for all 8 G (0,1], we 
have with probability larger than 1 — 5.- Vr G 7^, Vn > 2, 


0 < An(r] 


A(r) + 


2cA(r)“log(#72/6) 


n 


+ 


41og(#72/6) 

3n 


Proof. The proof is a straightforward application of Theorem A on p. 
combined with the union bound and Assumption!^ 


201 in Serfling (1980), 

□ 


The same argument as that used to prove Assertion (i] in Theorem (namely, freezing the 
Xi’s, applying Hoeffding inequality and the union bound) shows that, for all 6 G (0,1), we have 
with probability at least 1 — 6: Vr G 72, 


. r; . ^ . /M + log(M/5] 

0 < UB(qr] -Un(qr) + y -^- 

for all Ti > 2 and B > 1 (observe that My, < 1 in this case). Now, combining this bound with the 
previous one and using the union bound, one gets that, for all 6 G (0,1], we have with probability 
larger than 1 — 5: Vr G 72, Vn > 2, VB > 1, 

f Ar . , /2cA(r]«log(2M/6) , 41og(2M/5] , /M + log(2M/6) 

0<UB(q,)-A(r) + y- + ^ - - -. 

Observing that, UB(qyg) < 0 by definition, we thus have with probability at least 1 — 6: 


A(rB] < 


2cA(rB)°^log(2M/S] 41og(2M/6) 
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3n 


M + log(2M/5) 

' B ■■ 


Choosing finally B = “^], the desired result is obtained by solving the inequality above for 

A(rB). 


F Proof of Theorem |4] 


As shown by the following lemma, which is a slight modification of Lemma 1 in Janson (1984), the 
deviation between the incomplete U-statistic and its complete version is of order Op)(l/\/B] for 
both sampling schemes. 
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Lemma 4. Suppose that the assumptions o/j^ are fulfilled. Then, we have: VH G TL, 
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(Uht(H) - I (Xi]i6a1 < 


Proof. Observe first that, in both cases (sampling without replacement and Bernoulli sampling), 
we have: VI 7 ^ J in A, 
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Hence, as (A(I)]igA and (Xi)igA are independent by assumption, we have: 
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Consider first the case of Bernoulli sampling. By virtue of Bernstein inequality applied to the 
independent variables (A(I) — B/^A)H(Xi) conditioned upon (Xi)igA) we have: VH G 71,^1 > 0, 


J^(A(I)-B/#A)H(Xi) 
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>t| fX 
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< 2 exp 


4BMj^ + 2Mnt/3 


Hence, combining this bound and the union bound, we obtain that: Vt > 0, 


sup |Uht(H) -Un(H) 
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> t I (X,)„a} < 2(1 + #A)''exp (^- . 


Solving 


6 = 2(1 + #A)'^exp ( — 


BH 


4M^ + 2Mnt/3) 


yields the desired bound. 

Consider next the case of the sampling without replacement scheme. Using the exponential 
inequality tailored to this situation proved in Serfiing (1974) (see Corollary 1.1 therein), we obtain: 
VH Gn,Vt> 0, 
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The proof can be then ended using the union bound, just like above. 


30 





























G Proof of Proposition 


For simplicity, we focus on one sample U-statistics of degree two (K = 1, di =2) since the argument 
easily extends to the general case. Let 11^(14) be a non-degenerate U-statistic of degree two: 


Un(H) = —-— y 

n n — 1 ^' 


>•<) 


In order to express the variance of 11^(11) based on its second Hoeffding decomposition (see Sec- 
2.1), we first introduce more notations: V(x, x') G 


tion 


Hi(x) =^E[H(x,X)] and H 2 (x,x') “= H(x, x') - ^(H) - Hi (x) - Hi (x'). 


/•i def 


Equipped with these notations, the (orthogonal) Hoeffding/Hajek decomposition of UnlH) can be 
written as 

Un(H) = q(H] + 2Tn(H) + Wn(H), 
involving centered and decorrelated random variables given by 


Tn(H) = -VHi(xO, 

n ^— 

i=l 

Wn(H) = ^^H2(Xi,Xj). 


nfri — 1 


t<i 


Recall that the U-statistic Wn_(H) is said to be degenerate, since E[H 2 (x, X)] = 0 for all x G d^i. 
Based on this representation and setting = Var[Hi (X)] and = Var[H 2 (X, X']], the variance of 
UTt(H) is given by 

4 fT^ 2 

Var[Un(H)] = ^ + ^ (50) 
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nfn-L 


As already pointed out in Section 3U, the variance of the incomplete U-statistic built by sampling 
with replacement is 


Var[UB(H)] = Var[Un(H)] + ^ f 1 - —- 

B V rL(n- 1 

= Var[Un(H)] + l fl^ 
B V n n — 1 


Var[H(X,X')] 
i2(x^ + aj]. 


(51) 


Take B = u' (n' — 1 ] for n' ^ n. It follows from (50) and (51) that in the asymptotic framework 


Q, the quantities Var[Un'(H)] and Var[UB(H)] are of the order 0(1 /n') and 0(1 /n'^) respectively 
as n' —> -|-oo. Hence these convergence rates hold for gn'(0) and gB(0] respectively. 
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